Abstract: The web pages are managed in the evidently fixed manner. The users are sanctioned to recognize the healthy impression from the internet page. The user’s convenient evidence is the pleasant statement for the users and various references are discordant one. The user extracts the enjoyable suspicion from the internet page on the essence of internet page template. Data mining on the Web by means of this becomes a consistent task for discovering convenient knowledge or information from the Web. However, satisfying information on the Web is constantly accompanied by a large approach of noise such as auspicious advertisements, navigation bars, copyright notices, etc. Although such information items are functionally relaxed for human viewers and binding for the Web site owners, they often control automated information gathering and Web data mining, e.g., Web page clustering, categorization, artificial intelligence, and information extraction. The proposed approach to minimize the noise webpage is the hybrid of Latent semantic examination (LSA) mutually Naive Bayesian Classifier. LSA is used to analyses the World Wide Web documents or web pages. Naive Bayes classifiers are intensively scalable, requiring an abode of parameters linear in the location of variables (features/predictors) in a training problem. Maximum-likelihood training can be done by evaluating a closed-form anticlimax, which takes linear presage, alternative than by invaluable iterative estimate as helpful for many other types of classifiers.
Keywords: Web Page Purification, Information Extraction, DOM Tree.